In this notebook, our focus is on persistent data — the kind that outlives a program that creates it.
That’s not true by default for objects a script constructs, of course; things like lists, dictionaries, and even class instance objects live in your computer’s memory and are lost as soon as the script ends.
To make data live longer, we need to do something special.
In Python there are (at least) five traditional ways to save information in between program executions:
Flat files
DBM keyed files
Pickled objects
Shelve files
SQL relational databases (RDBMSs)
For programs that can benefit from the power of SQL, Python also broadly supports relational database management systems (RDBMSs).
The databases we’ll meet in this notebook, though, are structured and processed in very different ways:
They store data in related tables of columns (rather than in persistent dictionaries of arbitrarily structured persistent Python objects).
They support the SQL query language for accessing data and exploiting relation- ships among it (instead of Python object traversals).
For some applications, the end result can be a potent combination. Moreover, some SQL-based database systems provide industrial-strength persistence support for enterprise-level data.
Today, there are freely available interfaces that let Python scripts utilize all common relational database systems, both free and commercial: MySQL, Oracle, Sybase, Informix, InterBase, PostgreSQL (Postgres), SQLite, ODBC, and more.
In addition, the Python community has defined a database API (aka DB API
) specification that works portably with a variety of underlying database packages.
Scripts written for this API can be migrated to different database vendor packages, with minimal or no source code changes.
As of Python 2.5, Python itself includes built-in support for the SQLite relational database system as part of its standard library.
Because this system supports the portable database API, it serves as a tool for both program storage and prototyping—systems developed with SQLite work largely unchanged when a more feature-rich database such as MySQL or Oracle is deployed.
Moreover, the popular SQLObject
and SQLAlchemy
third-party systems both provide an Object Relational Mapper (ORM), which grafts an object interface onto your database, in which tables are modeled by as Python classes, rows by instances of those classes, and columns by instance attributes.
The Python Database API (DB API) specification defines an interface for communicating with underlying database systems from Python scripts.
Vendor-specific database interfaces for Python may or may not conform to this API completely, but all database extensions for Python in common use are minor variations on a theme.
Under the database API, SQL databases in Python are grounded on three core concepts:
Connection Objects:
Represent a connection to a database, are the interface to rollback and commit operations, provide package implementation details, and generate cursor objects.
Cursor Objects:
Represent an SQL statement submitted as a string and can be used to access and step through SQL statement results.
Query results of SQL select statements:
Are returned to scripts as Python sequences of sequences (e.g., a list of tuples), representing database tables of rows. Within these row sequences, column field values are normal Python objects such as strings, integers, and floats (e.g.,
[('bob', 48), ('emily',47)]
). Column values may also be special types that encapsulate things such as date and time, and databaseNULL
values are returned as the PythonNone
object.
Beyond this, the API defines a standard set of database exception types, special database type object constructors, and informational top-level calls including thread safety and replacement style checks.
For instance, to establish a database connection under the Python API-compliant Oracle interface, install the commonly used Python Oracle extension module (i.e. pip install cx_oracle
) as well as Oracle itself, and then run a statement of this form:
connobj = connect("user/password@system")
This call’s arguments may vary per database and vendor (e.g., some may require network details or a local file’s name), but they generally contain what you provide to log in to your database system.
Once you have a connection object, there a variety of things you can do with it, including:
connobj.close() # close connection now (not at object __del__ time)
connobj.commit() # commit any pending transactions to the database
connobj.rollback() # roll database back to start of pending transactions
But one of the most useful things to do with a connection object is to generate a cursor object:
cursobj = connobj.cursor() # return a new cursor object for running SQL
Cursor objects have a set of methods, too (e.g., close to close the cursor before its destructor runs, and callproc to call a stored procedure), but the most important may be this one:
cursobj.execute(sqlstring [, parameters]) # run SQL query or command string
Parameters are passed in as a sequence or mapping of values, and are substituted into the SQL
statement string according to the interface module’s replacement target conventions.
The execute method can be used to run a variety of SQL
statement strings:
CREATE TABLE
);UPDATE
or INSERT
);SELECT
)After running an SQL statement, the cursor’s rowcount
attribute gives the number of rows changed (for DML changes) or fetched (for DQL queries), and the cursor’s description
attribute gives column names and types after a query;
execute
also returns the number of rows affected or fetched in the most vendor interfaces.
For DQL query statements, you must call one of the fetch
methods to complete the operation:
single_tuple = cursobj.fetchone() # fetch next row of a query result
list_of_tuple = cursobj.fetchmany([size]) # fetch next set of rows of query result
list_of_tuple = cursobj.fetchall() # fetch all remaining rows of the result
And once you’ve received fetch method results, table information is processed using normal Python sequence operations; for example, you can step through the tuples in a fetchall
result list with a simple for loop or comprehension expression.
Most Python database interfaces also allow you to provide values to be passed to SQL
statement strings, by providing targets and a tuple of parameters. For instance:
query = 'SELECT name, shoesize FROM spam WHERE job = ? AND age = ?'
cursobj.execute(query, (value1, value2))
results = cursobj.fetchall()
for row in results:
pass # do something
In this event, the database interface utilizes prepared statements (an optimization and convenience) and correctly passes the parameters to the database regardless of their Python types.
The notation used to code targets in the query string may vary in some database interfaces (e.g., :p1
and :p2
or two %s
, rather than the two ?s
used by the Oracle interface); in any event, this is not the same as Python’s
%
string formatting operator, as it sidesteps security issues along the way.
Finally, if your database supports stored procedures, you can call them with the callproc
method or by passing an SQL CALL
or EXEC
statement string to the execute method.
callproc
may generate a result table retrieved with a fetch
variant, and returns a modified copy of the input sequence — input parameters are left untouched, and output and input/output parameters are replaced with possibly new values.
Additional API features, including support for database blobs
(roughly, with sized results), is described in the API’s documentation.
For now, let’s move on to do some real SQL processing in Python.
We don’t have space to provide an exhaustive reference for the database API in this notebook.
To sample the flavor of the interface, though, let’s step through a few simple examples.
We’ll use the SQLite database system for this tutorial.
SQLite is a standard part of Python itself, which you can reasonably expect to be available in all Python installations. Although SQLite implements a complete relational database system, it takes the form of an in-process library instead of a server.
This generally makes it better suited for program storage than for enterprise-level data needs.
Thanks to Python’s portable DB API, though, other popular database packages such as PostgreSQL, MySQL, and Oracle are used almost identically; the initial call to log in to the database will be all that normally requires different argument values for scripts that use standard SQL code.
Because of this, we can use the SQLite system both as a prototyping tool in applications development and as an easy way to get started with the Python SQL database API in this book.
Regardless of which database system your scripts talk to, the basic SQL interface in Python is very simple.
In fact, it’s hardly object-oriented at all queries and other database commands are sent as strings of SQL.
Whether large or small, though, the Python code needed to process your database turns out to be surprisingly straightforward.
To get started, the first thing we need to do is open a connection to the database and create a table for storing records:
In [1]:
import sqlite3
conn = sqlite3.connect('data/dbase1')
We start out by importing the Python SQLite interface here— it’s a standard library module called sqlite3
to our scripts.
Next we create a connection object, passing in the items our database requires at start-up time—here, the name of the local file where our databases will be stored.
This file is what you’ll want to back up to save your database. It will create the file if needed, or open its current content; SQLite also accepts that special string :memory:
to create a temporary database in memory instead.
As long as a script sticks to using standard SQL code, the connect call’s arguments are usually the only thing that can vary across different database systems.
For example, in the MySQL interface this call accepts a network host’s domain name, user name, and password, passed as keyword arguments instead, and the Oracle example sketched earlier expects a more specific sting syntax.
Once we’ve gotten past this platform-specific call, though, the rest of the API is largely database neutral.
Next, let’s make a cursor for submitting SQL statements to the database server, and submit one to create a first table:
In [2]:
curs = conn.cursor()
try:
curs.execute('drop table people')
except:
pass # did not exist
curs.execute('create table people (name char(30), job char(10), pay int(4))')
Out[2]:
The last command here creates the table called “people” within the database; the name, job, and pay information specifies the columns in this table, as well as their datatypes, using a “type(size)” syntax — two strings and an integer.
Datatypes can be more sophisticated than ours, but we’ll ignore such details here.
In SQLite, the file is the database, so there’s no notion of creating or using a specific database within it, as there is in some systems.
There are three basic statement-based approaches we can use here:
Here is the simple case
In [3]:
curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 50000))
curs.rowcount
Out[3]:
In [4]:
sqlite3.paramstyle
Out[4]:
Here, qmark
means this module accepts ?
for replacement targets.
Other database modules might use styles such as format (meaning a %s
target), or numeric indexes or mapping keys; see the DB API for more details.
To insert multiple rows with a single statement, use the executemany
method and a sequence of row sequences (e.g., a list of lists). This call is like calling execute
once for each row sequence in the argument, and in fact may be implemented as such; database interfaces may also use database-specific techniques to make this run quicker, though:
In [5]:
curs.executemany('insert into people values (?, ?, ?)',
[ ('Sue', 'mus', '70000'),
('Ann', 'mus', '60000')])
curs.rowcount
Out[5]:
We inserted two rows at once in the last statement.
It’s hardly any more work to achieve the same result by inserting one row at a time with a Python loop:
In [6]:
rows = [['Tom', 'mgr', 100000], ['Kim', 'adm', 30000], ['pat', 'dev', 90000]]
for row in rows:
curs.execute('insert into people values (? , ?, ?)', row)
conn.commit()
Blending Python and SQL like this starts to open up all sorts of interesting possibilities.
Notice the last command; we always need to call the connection’s commit
method to write our changes out to the database. Otherwise, when the connection is closed, our changes may be lost.
In fact, until we call the commit
method, none of our inserts may be visible from other database connections.
Technically, the API suggests that a connection object should automatically call its rollback
method to back out changes that have not yet been committed, when it is closed (which happens manually when its close method is called, or automatically when the connection object is about to be garbage collected).
For database systems that don’t support transaction commit and rollback operations, these calls may do nothing. SQLite implements both the commit and rollback methods; the latter rolls back any changes made since the last commit.
In [7]:
curs.execute('select * from people')
for row in curs.fetchall():
print(row)
Tuple unpacking comes in handy in loops here, too, to pick out column values as we go.
Here’s a simple formatted display of two of the columns’ values:
In [8]:
curs.execute('select * from people')
for (name, job, pay) in curs.fetchall():
print(name, ':', pay)
Because the query result is a sequence, we can use Python’s powerful sequence and iteration tools to process it.
For instance, to select just the name column values, we can run a more specific SQL query and get a list of tuples
In [9]:
curs.execute('select name from people')
names = curs.fetchall()
names
Out[9]:
fetchall
vs fetchone
The fetchall
call we’ve used so far fetches the entire query result table all at once, as a single sequence (an empty sequence comes back, if the result is empty).
That’s convenient, but it may be slow enough to block the caller temporarily for large result tables or generate substantial network traffic if the server is running remotely (something could easily require a parallel thread in GUI).
To avoid such a bottleneck, we can also grab just one row, or a bunch of rows, at a time with fetchone
and fetchmany
.
The fetchone
call returns the next result row or a None
false value at the end of the table:
In [10]:
curs.execute('select * from people')
while True:
row = curs.fetchone()
if not row:
break
print(row)
In [11]:
curs.execute('select * from people')
for row in curs:
print(row)
The fetchmany
call returns a sequence of rows from the result, but not the entire table; you can specify how many rows to grab each time with a parameter or rely on the default as given by the cursor’s arraysize
attribute.
Each call gets at most that many more rows from the result or an empty sequence at the end of the table:
In [13]:
curs.execute('select * from people')
while True:
rows = curs.fetchmany() # size=N optional argument
if not rows:
break
for row in rows:
print(row)
In [14]:
curs.description
Out[14]:
In [15]:
curs.execute('select * from people')
colnames = [desc[0] for desc in curs.description]
for row in curs:
print('-' * 30)
for (name, value) in zip(colnames, row):
print('%s => %s' % (name, value))
In [ ]:
# %load files/makedicts.py
"""
convert list of row tuples to list of row dicts with field name keys
"""
import sqlite3
def makedicts(cursor, query, params=()):
cursor.execute(query, params)
colnames = [desc[0] for desc in cursor.description]
rowdicts = [dict(zip(colnames, row)) for row in cursor]
return rowdicts
conn = sqlite3.connect('data/dbase1')
cursor = conn.cursor()
query = 'select name, pay from people where pay < ?'
lowpay = makedicts(cursor, query, [70000])
for rec in lowpay:
print(rec)
In [18]:
!cat data/data.txt
In [19]:
!cat data/data2.txt